{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# How to connect to Milvus from a notebook\n",
    "\n",
    "There are several different ways to start up a Milvus server.\n",
    "\n",
    "1. [Milvus Lite](#milvus_lite) is a local Python server that can run in Jupyter notebooks or Google Colab, requires pymilvus>=2.4.3.  \n",
    "   ⛔️ Only meant for demos and local testing.\n",
    "2. [Zilliz cloud free tier](#zilliz_free)\n",
    "3. [Milvus standalone docker](#milvus_docker) requires [local docker](https://milvus.io/docs/install_standalone-docker.md) installed and running.\n",
    "4. [LangChain](#langchain) - all [3rd party adapters](https://api.python.langchain.com/en/latest/vectorstores/langchain.vectorstores.milvus.Milvus.html) use Milvus Lite.\n",
    "5. [LlamaIndex](#llama_index) - all [3rd party adapters](https://api.python.langchain.com/en/latest/vectorstores/langchain.vectorstores.milvus.Milvus.html) use Milvus Lite.\n",
    "6. Milvus kubernetes cluster requires a [K8s cluster](https://milvus.io/docs/install_cluster-milvusoperator.md) up and running.\n",
    "\n",
    "💡 **For production workloads**, it is recommended to use Milvus local docker, kubernetes clusters, or fully-managed Milvus on Zilliz Cloud. <br>\n",
    "\n",
    "I'll demonstrate how to connect using the [Python SDK](https://github.com/milvus-io/pymilvus/blob/master/pymilvus/milvus_client/milvus_client.py). For more details, see this [Python example](https://github.com/milvus-io/pymilvus/blob/bac31951d5c5a9dacb6632e535e3c4d284726390/examples/hello_milvus_simple.py).  "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 1. Milvus Lite  <a class=\"anchor\" id=\"milvus_lite\"></a>\n",
    "\n",
    "Milvus Lite is a light Python server that can run locally.  It's ideal for getting started with Milvus, running on a laptop, in a Jupyter notebook, or on Colab. \n",
    "\n",
    "⛔️ Please note Milvus Lite is only meant for demos, not for production workloads.\n",
    "\n",
    "- [github](https://github.com/milvus-io/milvus-lite)\n",
    "- [documentation](https://milvus.io/docs/quickstart.md)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "pymilvus:2.4.3\n"
     ]
    }
   ],
   "source": [
    "# !python -m pip install -U pymilvus\n",
    "import pymilvus\n",
    "print(f\"pymilvus:{pymilvus.__version__}\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Connect a client to the Milvus Lite server.\n",
    "from pymilvus import MilvusClient\n",
    "mc = MilvusClient(\"milvus_demo.db\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Successfully created collection: `MilvusDocs`\n"
     ]
    }
   ],
   "source": [
    "# Create a collection.\n",
    "COLLECTION_NAME = \"MilvusDocs\"\n",
    "EMBEDDING_DIM = 256\n",
    "\n",
    "# Milvus Lite uses the MilvusClient object.\n",
    "if mc.has_collection(COLLECTION_NAME):\n",
    "    mc.drop_collection(COLLECTION_NAME)\n",
    "    print(f\"Successfully dropped collection: `{COLLECTION_NAME}`\")\n",
    "\n",
    "# Create a collection with flexible schema and AUTOINDEX.\n",
    "mc.create_collection(COLLECTION_NAME, \n",
    "        EMBEDDING_DIM,\n",
    "        consistency_level=\"Eventually\", \n",
    "        auto_id=True,  \n",
    "        overwrite=True,\n",
    "    )\n",
    "print(f\"Successfully created collection: `{COLLECTION_NAME}`\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Successfully dropped collection: `MilvusDocs`\n"
     ]
    }
   ],
   "source": [
    "# Drop the collection.\n",
    "mc.drop_collection(COLLECTION_NAME)\n",
    "print(f\"Successfully dropped collection: `{COLLECTION_NAME}`\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 2. Zilliz free tier  <a class=\"anchor\" id=\"zilliz_free\"></a>\n",
    "\n",
    "This section uses [Zilliz](https://zilliz.com), free tier.  If you have not already, sign up for a [free trial](https://cloud.zilliz.com/signup).  \n",
    "\n",
    "If you already have a Zilliz account and want to use free tier, just be sure to select \"Starter\" option when you [create your cluster](https://docs.zilliz.com/docs/create-cluster).  ❤️‍🔥 **In other words, everybody gets free tier!!**  \n",
    "- One free tier cluster per account.\n",
    "- Per free tier cluster, up to two collections at a time. (Think of a collection like a database table. Each collection has an index, schema, and consistency-level).\n",
    "- Each free tier collection can support up to 1 Million vectors (Think of this like rows in a database table).\n",
    "\n",
    "If you have larger data, we recommend our Pay-as-you-go Serverless or Enterprise plan.  Free tier and Pay-as-you-go are Zilliz-managed AWS, Google, or Azure services.  BYOC is possible in the Enterprise plan.\n",
    "\n",
    "### 👩 Set up instructions for Zilliz \n",
    "\n",
    "1. From [cloud.zilliz.com](cloud.zilliz.com), click **\"+ Create Cluster\"**\n",
    "2. Select <i>**Starter**</i> option for the cluster and click **\"Next: Create Collection\"**\n",
    "   <div>\n",
    "   <img src=\"../images/zilliz_cluster_choose.png\" width=\"60%\"/>\n",
    "   </div>\n",
    "\n",
    "1. Name your collection with a <i>**Collection Name**</i> and click **\"Create Collection and Cluster\"**.\n",
    "2. From the Clusters page, \n",
    "   - copy the cluster uri and save somewhere locally.\n",
    "   - copy your cluster API KEY.  Keep this private! \n",
    "     <div>\n",
    "     <img src=\"../images/zilliz_cluster_uri_token.png\" width=\"80%\"/>\n",
    "     </div>\n",
    "\n",
    "3. Add the API KEY to your environment variables.  See this [article for instructions](https://help.openai.com/en/articles/5112595-best-practices-for-api-key-safety) how in either Windows or Mac/Linux environment.\n",
    "4. In Jupyter, you'll also need .env file (in same dir as notebooks) containing lines like this:\n",
    "   - ZILLIZ_API_KEY=value\n",
    "5. In your code, connect to your Zilliz cluster, see code example below."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Type of server: Zilliz Cloud Vector Database(Compatible with Milvus 2.4)\n"
     ]
    }
   ],
   "source": [
    "import os\n",
    "from pymilvus import (connections, MilvusClient, utility)\n",
    "TOKEN = os.getenv(\"ZILLIZ_API_KEY\")\n",
    "\n",
    "# Connect to Zilliz cloud using endpoint URI and API key TOKEN.\n",
    "CLUSTER_ENDPOINT=\"https://in03-xxxx.api.gcp-us-west1.zillizcloud.com:443\"\n",
    "CLUSTER_ENDPOINT=\"https://in03-8bc9fd463236b1a.api.gcp-us-west1.zillizcloud.com:443\"\n",
    "\n",
    "connections.connect(\n",
    "  alias='default',\n",
    "  uri=CLUSTER_ENDPOINT,\n",
    "  token=TOKEN,\n",
    ")\n",
    "\n",
    "# Check if the server is ready and get collection name.\n",
    "print(f\"Type of server: {utility.get_server_version()}\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Successfully created collection: `movies`\n"
     ]
    }
   ],
   "source": [
    "COLLECTION_NAME = \"movies\"\n",
    "EMBEDDING_DIM = 256\n",
    "\n",
    "# Use no-schema Milvus client uses flexible json key:value format.\n",
    "# https://milvus.io/docs/using_milvusclient.md\n",
    "mc = MilvusClient(\n",
    "    uri=CLUSTER_ENDPOINT,\n",
    "    token=TOKEN)\n",
    "\n",
    "# Check if collection already exists, if so drop it.\n",
    "has = utility.has_collection(COLLECTION_NAME)\n",
    "if has:\n",
    "    drop_result = utility.drop_collection(COLLECTION_NAME)\n",
    "    print(f\"Successfully dropped collection: `{COLLECTION_NAME}`\")\n",
    "\n",
    "# Create a collection with flexible schema and AUTOINDEX.\n",
    "mc.create_collection(COLLECTION_NAME, \n",
    "                     EMBEDDING_DIM,\n",
    "                     consistency_level=\"Eventually\", \n",
    "                     auto_id=True,  \n",
    "                     overwrite=True,\n",
    "                    )\n",
    "print(f\"Successfully created collection: `{COLLECTION_NAME}`\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Successfully disconnected from the server.\n"
     ]
    }
   ],
   "source": [
    "# Drop collection\n",
    "utility.drop_collection(COLLECTION_NAME)\n",
    "\n",
    "# Disconnect from the server.\n",
    "try:\n",
    "  connections.disconnect(alias=\"default\")\n",
    "  print(\"Successfully disconnected from the server.\")\n",
    "except:\n",
    "  pass"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 3. Milvus standalone Docker <a class=\"anchor\" id=\"milvus_docker\"></a>\n",
    "\n",
    "This section uses [Milvus standalone](https://milvus.io/docs/configure-docker.md) on Docker. <br>\n",
    ">⛔️ Make sure you pip install the correct version of pymilvus and server yml file.  **Versions (major and minor) should all match**.\n",
    "\n",
    "1. [Install Docker](https://docs.docker.com/get-docker/)\n",
    "2. Start your Docker Desktop\n",
    "3. Download the latest [docker-compose.yml](https://milvus.io/docs/install_standalone-docker.md#Download-the-YAML-file) (or run the wget command, replacing version to what you are using)\n",
    "> wget https://github.com/milvus-io/milvus/releases/download/v2.4.0-rc.1/milvus-standalone-docker-compose.yml -O docker-compose.yml\n",
    "4. From your terminal:  \n",
    "   - cd into directory where you saved the .yml file (usualy same dir as this notebook)\n",
    "   - docker compose up -d\n",
    "   - verify (either in terminal or on Docker Desktop) the containers are running\n",
    "5. From your code (see notebook code below):\n",
    "   - Import milvus\n",
    "   - Connect to the local milvus server"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [],
   "source": [
    "# !pip install -U pymilvus"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Pymilvus: 2.4.3\n"
     ]
    }
   ],
   "source": [
    "import pymilvus, time\n",
    "from pymilvus import (connections, MilvusClient, utility)\n",
    "print(f\"Pymilvus: {pymilvus.__version__}\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "v2.4.1\n"
     ]
    }
   ],
   "source": [
    "####################################################################################################\n",
    "# Connect to local server running in Docker container.\n",
    "# Download the latest .yaml file: https://milvus.io/docs/install_standalone-docker.md\n",
    "# Or, download directly from milvus github (replace with desired version):\n",
    "# !wget https://github.com/milvus-io/milvus/releases/download/v2.4.0-rc.1/milvus-standalone-docker-compose.yml -O docker-compose.yml\n",
    "####################################################################################################\n",
    "\n",
    "# Start Milvus standalone on docker, running quietly in the background.\n",
    "# !docker compose up -d\n",
    "\n",
    "# # Verify which local port the Milvus server is listening on\n",
    "# !docker ps -a #19530/tcp\n",
    "\n",
    "# Connect to the local server.\n",
    "connection = connections.connect(\n",
    "  alias=\"default\", \n",
    "  host='localhost', # or '0.0.0.0' or 'localhost'\n",
    "  port='19530'\n",
    ")\n",
    "\n",
    "# Get server version.\n",
    "print(utility.get_server_version())\n",
    "\n",
    "# Use no-schema Milvus client uses flexible json key:value format.\n",
    "mc = MilvusClient(connections=connection)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Successfully dropped collection: `movies`\n",
      "Created collection: movies\n"
     ]
    }
   ],
   "source": [
    "COLLECTION_NAME = \"movies\"\n",
    "EMBEDDING_DIM = 256\n",
    "\n",
    "# Check if collection already exists, if so drop it.\n",
    "has = utility.has_collection(COLLECTION_NAME)\n",
    "if has:\n",
    "    drop_result = utility.drop_collection(COLLECTION_NAME)\n",
    "    print(f\"Successfully dropped collection: `{COLLECTION_NAME}`\")\n",
    "\n",
    "# Create a collection with flexible schema and AUTOINDEX.\n",
    "mc.create_collection(\n",
    "        COLLECTION_NAME, \n",
    "        EMBEDDING_DIM, \n",
    "        consistency_level=\"Eventually\", \n",
    "        auto_id=True,  \n",
    "        overwrite=True,\n",
    "        )\n",
    "print(f\"Created collection: {COLLECTION_NAME}\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\u001b[33mWARN\u001b[0m[0000] /Users/christy/Documents/bootcamp_scratch/bootcamp/docker-compose.yml: `version` is obsolete \n",
      "Successfully disconnected from the server.\n"
     ]
    }
   ],
   "source": [
    "# Stop local milvus.\n",
    "!docker compose down\n",
    "\n",
    "# Disconnect from the server.\n",
    "try:\n",
    "  connections.disconnect(alias=\"default\")\n",
    "  print(\"Successfully disconnected from the server.\")\n",
    "except:\n",
    "  pass"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## LangChain <a class=\"anchor\" id=\"langchain\"></a>\n",
    "\n",
    "All 3rd party adapters use [Milvus Lite](https://milvus.io/docs/quickstart.md).  \n",
    "\n",
    "LangChain APIs hide a lot of the steps to convert raw unstructured data into vectors and store the vectors in Milvus.\n",
    "- [LangChain docs](https://python.langchain.com/v0.2/docs/integrations/vectorstores/milvus/)\n",
    "- [Milvus docs](https://milvus.io/docs/integrate_with_langchain.md)\n",
    "\n",
    "LangChain default values:\n",
    "- collection_name: LangChainCollection\n",
    "- schema: ['pk', 'source', 'text', 'vector']\n",
    "- auto_id: True\n",
    "- {'index_type': 'HNSW',\n",
    " 'metric_type': 'L2',\n",
    " 'params': {'M': 8, 'efConstruction': 64}}\n",
    "- consistency_level: 'Session'\n",
    "- overwrite: False"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {},
   "outputs": [],
   "source": [
    "# !python -m pip install -U langchain_community unstructured langchain-milvus langchain-huggingface"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "loaded 22 documents\n"
     ]
    }
   ],
   "source": [
    "# UNCOMMENT TO READ WEB DOCS FROM A LOCAL DIRECTORY.\n",
    "\n",
    "# Read docs into LangChain\n",
    "from langchain.document_loaders import DirectoryLoader\n",
    "\n",
    "# Load HTML files from a local directory\n",
    "path = \"RAG/rtdocs_new/\"\n",
    "loader = DirectoryLoader(path, glob='*.html')\n",
    "docs = loader.load()\n",
    "\n",
    "num_documents = len(docs)\n",
    "print(f\"loaded {num_documents} documents\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "length doc: 11016\n",
      "('Why MilvusDocsTutorialsToolsBlogCommunityStars0Try Managed Milvus '\n",
      " 'FREESearchHomev2.4.xAbout MilvusGe')\n"
     ]
    }
   ],
   "source": [
    "# Inspect the first document.\n",
    "import pprint\n",
    "print(f\"length doc: {len(docs[0].page_content)}\")\n",
    "pprint.pprint(docs[0].page_content.replace('\\n', '')[:100])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "/opt/miniconda3/envs/py311-unum/lib/python3.11/site-packages/huggingface_hub/file_download.py:1132: FutureWarning: `resume_download` is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use `force_download=True`.\n",
      "  warnings.warn(\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "EMBEDDING_DIM: 1024\n",
      "Created Milvus collection from 427 docs in 33.14 seconds\n"
     ]
    }
   ],
   "source": [
    "from langchain_milvus import Milvus\n",
    "from langchain_huggingface import HuggingFaceEmbeddings\n",
    "from langchain.text_splitter import RecursiveCharacterTextSplitter\n",
    "import time, pprint\n",
    "\n",
    "# Define the embedding model.\n",
    "model_name = \"BAAI/bge-large-en-v1.5\"\n",
    "model_kwargs = {'device': 'cpu'}\n",
    "encode_kwargs = {'normalize_embeddings': True}\n",
    "embed_model = HuggingFaceEmbeddings(\n",
    "    model_name=model_name,\n",
    "    model_kwargs=model_kwargs,\n",
    "    encode_kwargs=encode_kwargs\n",
    ")\n",
    "EMBEDDING_DIM = embed_model.dict()['client'].get_sentence_embedding_dimension()\n",
    "print(f\"EMBEDDING_DIM: {EMBEDDING_DIM}\")\n",
    "\n",
    "# Chunking\n",
    "text_splitter = RecursiveCharacterTextSplitter(chunk_size=512, chunk_overlap=51)\n",
    "\n",
    "# Create a Milvus collection from the documents and embeddings.\n",
    "start_time = time.time()\n",
    "docs = text_splitter.split_documents(docs)\n",
    "vectorstore = Milvus.from_documents(\n",
    "    documents=docs,\n",
    "    embedding=embed_model,\n",
    "    connection_args={\n",
    "        \"uri\": \"./milvus_demo.db\"},\n",
    "    # Override LangChain default values for Milvus.\n",
    "    consistency_level=\"Eventually\",\n",
    "    drop_old=True,\n",
    "    index_params = {\n",
    "        \"metric_type\": \"COSINE\",\n",
    "        \"index_type\": \"AUTOINDEX\",\n",
    "        \"params\": {},}\n",
    ")\n",
    "end_time = time.time()\n",
    "print(f\"Created Milvus collection from {len(docs)} docs in {end_time - start_time:.2f} seconds\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "collection_name: LangChainCollection\n",
      "schema: ['source', 'text', 'pk', 'vector']\n",
      "auto_id: True\n",
      "{'index_type': 'AUTOINDEX', 'metric_type': 'COSINE', 'params': {}}\n",
      "'consistency: Eventually'\n",
      "'drop_old: True'\n"
     ]
    }
   ],
   "source": [
    "# Describe the collection.\n",
    "print(f\"collection_name: {vectorstore.collection_name}\")\n",
    "print(f\"schema: {vectorstore.fields}\")\n",
    "print(f\"auto_id: {vectorstore.auto_id}\")\n",
    "pprint.pprint(vectorstore.index_params)\n",
    "pprint.pprint(f\"consistency: {vectorstore.consistency_level}\")\n",
    "vectorstore.drop_old = True\n",
    "pprint.pprint(f\"drop_old: {vectorstore.drop_old}\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Delete the Milvus collection.\n",
    "del vectorstore"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## LlamaIndex <a class=\"anchor\" id=\"llama_index\"></a>\n",
    "\n",
    "All 3rd party adapters use [Milvus Lite](https://milvus.io/docs/quickstart.md).  \n",
    "\n",
    "LlamaIndex APIs hide a lot of the steps to convert raw unstructured data into vectors and store the vectors in Milvus.\n",
    "- [LlamaIndex docs](https://docs.llamaindex.ai/en/latest/examples/vector_stores/MilvusIndexDemo/)\n",
    "- [Milvus docs](https://milvus.io/docs/integrate_with_llamaindex.md)\n",
    "\n",
    "LlamaIndex default values:\n",
    "- collection_name: llamacollection\n",
    "- schema: ['doc_id', 'embedding']\n",
    "- auto_id: True\n",
    "- {'index_type': 'None',\n",
    " 'metric_type': 'IP',\n",
    "- consistency_level: 'Strong'\n",
    "- overwrite: False"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 20,
   "metadata": {},
   "outputs": [],
   "source": [
    "# !python -m pip install -U --no-cache-dir llama-index llama-index-embeddings-huggingface llama-index-vector-stores-milvus"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 33,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "loaded 22 documents\n"
     ]
    }
   ],
   "source": [
    "# UNCOMMENT TO READ WEB DOCS FROM A LOCAL DIRECTORY.\n",
    "\n",
    "# Read docs into LlamaIndex\n",
    "from llama_index.core import SimpleDirectoryReader\n",
    "\n",
    "# Load HTML files from a local directory\n",
    "# https://docs.llamaindex.ai/en/stable/api_reference/readers/simple_directory_reader\n",
    "# Supposed to automatically parse files based on their extension.\n",
    "path = \"RAG/rtdocs_new/\"\n",
    "loader = SimpleDirectoryReader(\n",
    "        input_dir=path, \n",
    "        required_exts=[\".html\"],\n",
    "        recursive=True # Recursively search subdirectories\n",
    "    )\n",
    "lli_docs = loader.load_data()\n",
    "\n",
    "num_documents = len(lli_docs)\n",
    "print(f\"loaded {num_documents} documents\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 34,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "length doc: 663373\n",
      "('<!DOCTYPE html><html lang=\"en\"><head><meta charSet=\"utf-8\"/><meta '\n",
      " 'http-equiv=\"x-ua-compatible\" conte')\n"
     ]
    }
   ],
   "source": [
    "# Inspect the first document.\n",
    "import pprint\n",
    "\n",
    "# html docs were not parsed by SimpleDirectoryReader.\n",
    "print(f\"length doc: {len(lli_docs[0].text)}\")\n",
    "pprint.pprint(lli_docs[0].text[:100])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 35,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "/var/folders/vn/4v5_m9mx69x3h7jcl1chb7nr0000gn/T/ipykernel_4999/3447014088.py:12: DeprecationWarning: Call to deprecated class method from_defaults. (ServiceContext is deprecated, please use `llama_index.settings.Settings` instead.) -- Deprecated since version 0.10.0.\n",
      "  service_context = ServiceContext.from_defaults(\n",
      "/opt/miniconda3/envs/py311-unum/lib/python3.11/site-packages/huggingface_hub/file_download.py:1132: FutureWarning: `resume_download` is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use `force_download=True`.\n",
      "  warnings.warn(\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Embedding model:\n",
      "{'cache_folder': None,\n",
      " 'class_name': 'HuggingFaceEmbedding',\n",
      " 'embed_batch_size': 10,\n",
      " 'max_length': 512,\n",
      " 'model_name': 'BAAI/bge-large-en-v1.5',\n",
      " 'normalize': True,\n",
      " 'num_workers': None,\n",
      " 'query_instruction': None,\n",
      " 'text_instruction': None}\n",
      "\n",
      "Start chunking, embedding, inserting...\n",
      "Created LlamaIndex collection from 1 docs in 101.56 seconds\n"
     ]
    }
   ],
   "source": [
    "from llama_index.core import (\n",
    "    Settings,\n",
    "    ServiceContext,\n",
    "    StorageContext,\n",
    "    VectorStoreIndex,\n",
    ")\n",
    "from llama_index.embeddings.huggingface import HuggingFaceEmbedding\n",
    "from llama_index.vector_stores.milvus import MilvusVectorStore\n",
    "import time, pprint\n",
    "\n",
    "# Define the embedding model.\n",
    "service_context = ServiceContext.from_defaults(\n",
    "    # LlamaIndex local: translates to the same location as default HF cache.\n",
    "    embed_model=\"local:BAAI/bge-large-en-v1.5\",\n",
    ")\n",
    "# Display what LlamaIndex exposes.\n",
    "print(\"Embedding model:\")\n",
    "temp = service_context.to_dict()\n",
    "pprint.pprint(temp['embed_model'])\n",
    "print()\n",
    "# LlamaIndex hides this but we need it to create the vector store!\n",
    "EMBEDDING_DIM = 1024\n",
    "\n",
    "# Create a Milvus collection from the documents and embeddings.\n",
    "vectorstore = MilvusVectorStore(\n",
    "    uri=\"./milvus_llamaindex.db\",\n",
    "    dim=EMBEDDING_DIM,\n",
    "    # Override LlamaIndex default values for Milvus.\n",
    "    consistency_level=\"Eventually\",\n",
    "    drop_old=True,\n",
    "    index_params = {\n",
    "        \"metric_type\": \"COSINE\",\n",
    "        \"index_type\": \"AUTOINDEX\",\n",
    "        \"params\": {},}\n",
    ")\n",
    "storage_context = StorageContext.from_defaults(\n",
    "    vector_store=vectorstore\n",
    ")\n",
    "\n",
    "print(f\"Start chunking, embedding, inserting...\")\n",
    "start_time = time.time()\n",
    "llamaindex = VectorStoreIndex.from_documents(\n",
    "    # Too slow!  Just use one document.\n",
    "    lli_docs[:1], \n",
    "    storage_context=storage_context, \n",
    "    service_context=service_context\n",
    ")\n",
    "end_time = time.time()\n",
    "print(f\"Created LlamaIndex collection from {len(lli_docs[:1])} docs in {end_time - start_time:.2f} seconds\")\n",
    "# Created LlamaIndex collection from 1 docs in 106.32 seconds"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 36,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "stores_text: True\n",
      "is_embedding_query: True\n",
      "stores_node: True\n",
      "uri: ./milvus_llamaindex.db\n",
      "token: \n",
      "collection_name: llamacollection\n",
      "dim: 1024\n",
      "embedding_field: embedding\n",
      "doc_id_field: doc_id\n",
      "similarity_metric: IP\n",
      "consistency_level: Eventually\n",
      "overwrite: False\n",
      "text_key: None\n",
      "output_fields: []\n",
      "index_config: {}\n"
     ]
    }
   ],
   "source": [
    "# Describe the collection, it looks like the Milvus overrides did not all work.\n",
    "temp = llamaindex.storage_context.vector_store.to_dict()\n",
    "first_15_keys = list(temp.keys())[:15]\n",
    "for key in first_15_keys:\n",
    "    print(f\"{key}: {temp[key]}\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 37,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Delete the Milvus collection.\n",
    "del llamaindex"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 38,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "The watermark extension is already loaded. To reload it, use:\n",
      "  %reload_ext watermark\n",
      "Author: Christy Bergman\n",
      "\n",
      "Python implementation: CPython\n",
      "Python version       : 3.11.8\n",
      "IPython version      : 8.22.2\n",
      "\n",
      "pymilvus    : 2.4.3\n",
      "llama_index : 0.10.44\n",
      "langchain   : 0.2.2\n",
      "unstructured: 0.14.4\n",
      "\n",
      "conda environment: py311-unum\n",
      "\n"
     ]
    }
   ],
   "source": [
    "# Props to Sebastian Raschka for this handy watermark.\n",
    "# !pip install watermark\n",
    "\n",
    "%load_ext watermark\n",
    "%watermark -a 'Christy Bergman' -v -p pymilvus,llama_index,langchain,unstructured --conda"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 39,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "llama-index                             0.10.44\n",
      "llama-index-agent-openai                0.2.7\n",
      "llama-index-cli                         0.1.12\n",
      "llama-index-core                        0.10.44\n",
      "llama-index-embeddings-huggingface      0.2.1\n",
      "llama-index-embeddings-openai           0.1.10\n",
      "llama-index-indices-managed-llama-cloud 0.1.6\n",
      "llama-index-legacy                      0.9.48\n",
      "llama-index-llms-ollama                 0.1.5\n",
      "llama-index-llms-openai                 0.1.22\n",
      "llama-index-multi-modal-llms-openai     0.1.6\n",
      "llama-index-program-openai              0.1.6\n",
      "llama-index-question-gen-openai         0.1.3\n",
      "llama-index-readers-file                0.1.23\n",
      "llama-index-readers-llama-parse         0.1.4\n",
      "llama-index-vector-stores-milvus        0.1.17\n"
     ]
    }
   ],
   "source": [
    "# Check all llamaindex packages info, make sure they latest.\n",
    "!pip list | grep llama-index"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "py310",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.11.8"
  },
  "orig_nbformat": 4
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
